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Abstract 

£N1 ■ In this paper, we present new algorithms and data structures for the nearest neighbor search- 

ing where the input points are exact and the query point is uncertain under the L\ distance 
metric. The uncertain query point is represented by a discrete probability density function (pdf), 
and the goal is to return the expected nearest neighbor, which minimizes the expected distance 
to the query point. Given a set of n exact points in the plane, we build an 0(n log n log log re- 
size data structure in O(nlognloglogn) time such that for any uncertain query point with k 
possible locations, the expected nearest neighbor can be found in 0(fclog 2 n + fclogfe) time. 
The previously best method (in PODS 2012) for this problem requires (3(nlog 2 n) preprocess- 
or . ing time, 0(n log 2 n) space, and (9(fc 2 log 3 n) query time. In addition, for the one-dimensional 
version of this problem, we build an 0(n)-size data structure in 0(n log n) time that can support 
. 0(k + logn) time queries. 
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Nearest neighbor searching is a fundamental and well-studied problem in computational geometry, 
due to its wide range of applications in databases, computer vision, image processing, information 
retrieval, pattern recognition, etc [HUD]- In general, for a set P of points in the d-D space M d , the 
problem asks for a data structure to quickly report the nearest neighbor in P for any query point. 

In many applications, e.g. face recognition and sensor networks, data is inherently imprecise due 
to various reasons, such as noise or multiple observations. Numerous classic problems, including 
clustering [T2] , skylines [I] , range queries [2] , and nearest neighbor searching r2B] , have been 
casted and studied under uncertainty in the past few years. In this paper, we are also interested in 
the nearest neighbor searching in uncertainty data. Further, we focus on the distances measured 
by the L\ metric, which is appropriate for certain applications such as VLSI design automation. 

X: 

1.1 The Problem Statement, Previous Work, and Our Results 

An uncertain point Q in the d-D space M. d (for d > 1) is represented as a discrete probability 
density function (pdf) /q : Q [0,1]. Instead of having one exact location, Q has a set of k 
possible locations: Q = {qi,--- , <?fc}, where qi has probability u>j = fq(qi) > being the true 
location of Q, and Yli=i w i = I- Throughout the paper, we use k to denote the number of the 
possible locations of any uncertain point Q; k is also known as the description complexity of Q [3]. 

For any two exact points p and q in M. d , denote by d(p, q) the distance of p and q. For any exact 
point p and any uncertain point Q, their expected distance, denoted by Ed(p, Q), is defined to be 



Ed(p, Q) = ^2 W i d (P' & 
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Let P be a set of n exact points in M. d . For any uncertain query point Q, the expected nearest 
neighbor (ENN) of Q in P, denoted by ip(P,Q), is 



ip(P,Q) = argmin Ed(p, Q). 
peP 

In other words, ip(P, Q) is a point of P whose expected distance to Q is minimum among all points 
in P. 

Given a set P of n exact points in W d , the expected nearest neighbor searching (or ENN searching) 
with uncertainty query problem is to design a data structure for P to quickly report the ENN of 
Q in P for any uncertain query point Q. 

In this paper, we consider the ENN searching with uncertainty query problem in the plane (i.e., 
d = 2). Further, we focus on the L\ distance, i.e., the distance d(p, q) is measured by the L\ metric. 
Specifically, for any two exact points p and q, suppose their coordinates are {p x ,p y ) and (q x ,q y ), 
respectively; then d(p, q) = \p x - q x \ + \p y - q y \. 

The L\ ENN searching problem in the plane has been studied by Agarwal et al. |3j, where an 
0(n log 2 ro)-size data structure is constructed in 0(n log 2 n) time such that each ENN query can 
be answered in 0(/c 2 log 3 n) time. In this paper, we give a new data structure for the problem. 
The data structure can be built in 0(n log n log log n) time and 0(n log n log log n) space, and the 
query time is 0(A;log 2 n + klogk). Our data structure is based on new observations and deep 
understanding on the problem, as well as advanced data structures, e.g., the compact interval trees 
|14| and the segment-dragging query data structure [BJ. Comparing with the previous work in [3J, 
our data structure has smaller preprocessing time and space, and less query time. 

In addition, we also present data structures for the ENN searching in the one-dimensional space 
(i.e, d = 1), under either L% or L2 metric (i.e., the Euclidean metric). Note that in the 1-D space, 
the L\ metric is the same as the L2 metric. For the L2 metric, only approximation results have 
been given in the high-dimensional space when d>2, e.g., [3j[T7]. In contrast, we present an exact 
data structure for the 1-D case with 0{n log n) preprocessing time and O(n) space, and each query 
can be answered in 0(k + logra) time. 

1.2 Related Work 

Different models have been proposed for the nearest neighbor searching under uncertainty. 

In the model of probabilistic nearest neighbor (PNN), each input point in P is an uncertain 
point that has probabilities to appear at certain locations. For any query point, one can look 
at the probability of each input point being the nearest neighbor of the query point. The main 
drawback of PNN is that it is computationally expensive: the nearest neighbor not only depends 
on the query point, but also depends on the probabilities of all input points. The model has been 
widely studied [5j El 13 13 [23 [26] . All of these methods were R-tree based heuristics and 
did not provide any guarantee on the query time in the worst case. For instance, Cheng et al. [7] 
studied the PNN query that returns those uncertain points whose probabilities of being the nearest 
neighbor are higher than some threshold, allowing some given errors in the answers. 

In the model of superseding nearest neighbor (SNN) [26J, given a query point, one can look at 
the superseding relationship of each pair of input points: one supersedes the other if only and if it 
has probability more than 0.5 being the nearest neighbor of the query point, where the probability 
computation is restricted to this pair of points. One can return the point, if such one exists, which 
supersedes all the others. Otherwise, one returns the minimal set S of data points such that any 
data point in S supersedes any data point not in S. 
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For the model of ENN, one looks at the expected distance from each data point to the query 
point. Since the expected distance of any input point only depends on the query point, efficient data 
structures are available. Ljosa et al. [18] investigated the expected fc-NN under L\ metric using 
and obtained e-approximation. Recently, Agarwal et al. [3j gave the first nontrivial methods for 
answering exact or approximate expected nearest neighbor queries under various distance functions 
(e.g., Li, L/2, and the squared Euclidean distance) with provable performance guarantee. Efficient 
data structures are also provided in [3] when the input data is uncertain and the query data is 
exact. It should be noted that ENN is not a good indicator under large uncertainty (e.g. refer to 
[26j for an explanation). 

When the input points are exact and the query point is uncertain, ENN is the same as the 
weighted version of the Sum aggregate nearest neighbors (ANN), which is a generalization of the 
Sum ANN. Only heuristics algorithms are known for answering Sum ANN queries |16l \TT\ fT9j [20] 
[2T j 1241 [25] . The best known heuristic algorithm for exact (weighted) Sum ANN queries is a R-tree 
based MEM method [21], and Li et al. [T7] gave a data structure with 3-approximation query 
performance for the Sum ANN. Agarwal et al. [3] gave a data structure with a polynomial-time 
approximation scheme for the ENN queries under the Euclidean distance metric, which also works 
for the Sum ANN queries. 

The rest of the paper is organized as follows. In Section [21 we give our results in the one- 
dimensional space, which are generalized to the two-dimensional space in Section [3j One may view 
Section [2] as a "warm-up" for Section [3j Section H] concludes the paper. 

For simplicity of discussion, we make a general position assumption that no two points in PUQ 
have the same x- or y-coordinate for any query Q. Our techniques can be extended to the general 
case. In the following paper, we always use Q as the uncertain query point. To simplify the 
notation, we will write Ed(p) for Ed(p, Q), and ip(P) for ifi(P,Q). For any subset P' C P, denote 
by ip(P') (or ip(P',Q)) the ENN of Q in P'. For any point q £ Q, let w(q) denote the probability 
of Q being located at q. Although ^2 q& nw(q) = 1, as a theoretical generalization, our techniques 
also work for the case where X^gq w (l) !• We simply define W = X^qgQ w (l)- 

2 The ENN Searching on the Real Line 

In the 1-D case, all input points in P are on a real line L. We assume L is the x-axis in the 
plane. Consider any uncertain query point Q = {qi, . . . ,qk} on L. For any point p on L, denote 
by x(p) the coordinate of p on L. Our goal for the query Q is to find ip(P), which is a point p in 
P minimizing the expected distance Ed(p) = J2 q eQ w (l)d(p,Q), where d(p,q) = \x(p) — x(q)\. 

A point p on L is global minimum if it minimizes the expected distance Ed(p) among all points 
on L. Note that the global minimum point on L may not be unique. 

To find ip(P), we use the following strategy. First, we find a global minimum point p* on L. 
Second, the point p* partitions L into two half-lines, and for each half-line, we find the point p in 
P on the half-line that is closest to p*; we claim that one of the above two points that has smaller 
expected distance to Q is vp(P). The details are given below. We first show how to find p* . 

Note that the points in Q ordered by their indices may not be a sorted order by their coordinates 
on L. Recall that W = ^2 q£ Q w(q). Let q* be the point in Q such that 

w(q) < W/2 and w(q*) + ^ w{q) > W/2. 

x(q)<x(q*),q<=Q x(q)<x{q*),q&Q 

In other words, if we view w(q) as the weight of x(q) for each q € Q, then x(q*) is the weighted 
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median of the set {x(q) \ q E Q} We claim that q* is a global minimum point on L. In order 
to prove the claim, we first prove the following Lemma [TJ 

Lemma 1 For any point p on L and p ^ q* , if we move p on L towards q* , the expected distance 
Ed{p) is monotonically decreasing. 

Proof: Without loss of generality, assume p is on the left side of q* and we move p on L to the 
right towards q* . The case where p is on the right side of q* can be analyzed similarly. At any 
moment during the movement of p, let Ql = {q \ q £ Q and x(q) < x(q*)} and let Qr = Q \ Ql- 
According to the definition of Ed(p), we have 

q&Q q&Ql q&Qn 

q&Qh <?eQj? q^Qh q&Qa 

Because p is to the left of q* , according to the definition of q*, ^2 qe q L w (q) < W/2 < ^2 qe Q R w(q) 
holds. Further, as p moves to the right towards q* , the value x(p) is monotonically increasing. 
Hence, as p moves, the first term in the above equation, i.e., [J2 q eQ L w (.l) ~ J2 q eQ R w (o)] ' x {p)i ls 
monotonically decreasing. 

As p moves to the right, the set Ql becomes monotonically larger (i.e., Ql will have more 
points) and Qr becomes monotonically smaller. Hence, the value ^2 qG q L w(q)-x(q) is monotonically 
increasing and the value X^qgQ w (q) ' x (q) ^ s monotonically decreasing. 

The above discussion leads to the conclusion that as p moves to the right towards q* , the value 
Ed(p) is monotonically decreasing. The lemma thus follows. □ 

Lemma Q] shows that Ed(p) is a convex function with respect to the position of p on L, and 
Ed(p) attains global minimum at p = p* . Hence, we have the following corollary. 

Corollary 1 The point q* is a global minimum point on L. 

Next, we show how to find the ENN t/j(P) with the help of q*. 

If q* is also a point in P, then let pi = p r = q*; otherwise let pi be the rightmost point in P 
that is to the left of q* , and let p r be the leftmost point in P that is to the right of q*. In other 
words, if q* P, pi and p r are the two adjacent points in the sorted list of P by their coordinates 
on L such that x(pi) < x(q*) < x(q r ). The following lemma is due to Lemma [TJ 

Lemma 2 The ENN ip(P) is one of pi and p r that has smaller expected distance to Q. 

Proof: If q* € P, the lemma simply follows since q* is a global minimum point. Otherwise, consider 
any point p G P. If x(p) < x(q*), then x(p) < x(pi) because pi is the rightmost point to the left 
of q*. By Lemma UJ Ed(p) > Ed(pz). Similarly, if x(p) > x(q*), we can prove Ed(p) > Ed(p r ). The 
lemma thus follows. □ 

According to our above discussion, our query algorithm for finding ip(P) works as follows: (1) 
compute q*; (2) find pi and p r ; (3) compute Ed(p^) and Ed(p r ), and one of pi and p r with smaller 
expected distance to Q is reported as ip(P) ■ 

In the algorithm above, Step (1) can be done in 0(k) time by the weighted selection algorithm 
[llj . For Step (2), if we sort all points in P by their coordinates on L as preprocessing, then pi and 
p r can be found in O(logn) time by binary search. Step (3) can be easily done in 0(k) time. We 
conclude with the following theorem. 



4 



Theorem 1 Given a set P of n exact points on the real line L, with 0(n log n) time and O(n) 
space preprocessing, the ENN ip{P) can be found in 0{k + \ogn) time for any uncertain query point 
Q on L. 

3 The Li ENN Searching in the Plane 

In this section, we present our results in the two-dimensional space, where the input point set P 
and the query point Q are give in the plane. 

Our techniques generalize those in Section [21 For any query Q, we first find a global minimum 
p* in the plane. Then, in each of the four quadrants with respect to p*, we find the ENN of Q in 
P in that quadrant. Unlike in the 1-d case where the binary search is sufficient, the difficulty here 
is that it is not easy to find the ENN of Q in each quadrant. To do so, by proving a monotone 
property as Lemma HJ we show that the ENN must be a on a "skyline" and thus we only need to 
somehow search the "skyline". Advanced data structures (e.g., the compact interval trees |14| and 
the segment-dragging queries [6]) are also used for efficient implementations. The details are given 
below. 

Consider any uncertain query point Q = {q±, ?2j • • • > For any point p in the plane, denote 
by x{p) the x-coordinate of p and by y(p) the y-coordinate of p. Our goal is to find ip(P), which is a 
point p G P that minimizes Ed(p) = ^2 q& Qw(q)d(p,q), where d(p,q) = \x(p) — x(q) \ + \y(p) —y(q)\- 

A point p in the plane is global minimum if it minimizes the expected distance Ed(p) among all 
points in the plane. Below, we first show how to find a global minimum point. 

3.1 Finding a Global Minimum Point 

Recall that W = Y^ q eQ w ( l l)- Let g* be the point in Q such that 

w (<l) < W/2 and w{q* x ) + ^ w{q) > W/2. 

x(q)<x(q*),q£Q x(q)<x(q*),qeQ 

In other words, if we view w(q) as the weight of x(q) for each q € Q, then x(q*) is the weighted 
median of the set {x(q) \ q € Q} Similarly, let q* be the point in Q such that 

w(q) < W/2 and w{q* y ) + ^ w(q) > W/2. 
y(q)<y(qy),q£Q y(q)<y(qy),g&Q 

Let q* be the intersection of the vertical line x = x(q*) and the horizontal line y = y(qy)- 
We claim that q* is a global minimum point in the plane. To prove the claim, we first prove the 
following Lemma which generalizes the result in Lemma [T] to the plane. A monotone path in 
the plane is a curve such that if we move from one endpoint of the curve to the other one, the 
^-coordinate is monotonically changing (either increasing or decreasing) and the y-coordinate is 
also monotonically changing (either increasing or decreasing). 

Lemma 3 For any point p in the plane with p ^ q* , if we move p towards q* along a monotone 
path, then the expected distance Ed(p) is monotonically decreasing. 

Proof: Without loss of generality, assume p is in the third quadrant with respect to q*, i.e., 
x(p) < x(q*) and y(p) < y(q*)- Hence, as p moves along any monotone path ir towards q*, both 



5 



xip) and y{p) are monotonically increasing. The case where p is in other quadrants can be analyzed 
similarly. According to the definition of Ed(p), we have 

Ed 0) = ^wiq) ■ d(p,q) = ^2w(q) ■ (|x(p) - x(q)\ + \yip) - y(q)\) 

qeQ qeQ 

= ^wiq) ■ \x(p) - x(q)\ + ^wiq) ■ \y(p) - y{q)\. 
qeQ qeQ 

Let Ed x (p) = J2 q eQ w ( q ) ' \ x (p) ~ x (l)\ and Ed y(p) = E gS Q w (q) ' [z/(p) ~ Hence, Ed(p) = 

Ed x (p) + Ed^p). Intuitively, Ed x ip) is the value of Ed(p) on the x-coordinate and Edyip) is the 
value of Ed(p) on the y-coordinate. In the sequel, by similar approach as in the proof of Lemma [TJ 
we show that as p moves along n, both Ed x (p) and Ed y (p) are monotonically decreasing. We only 
prove the case of Ed x (p), and the case of Ed y (p) can be proved quite analogously. 

At any moment during the movement, let Ql be the subset of points in Q that are to the left 
or on the vertical line x = xiq*), i.e., Ql = {q \ q € Q and xiq) < xiq*)}. Let Qr = Q \ Ql- We 
have 

Ed x(p) = w ^ ' \ x &> ~ \ = J2 w ^ ■ t x ^) ~ x ( q ^ + ^2 w ^ ' ^ x ^ ~ x ( p ^ 

qeQ qeQ L qeQ R 

= [ ^2 w (q) - ^2 w ( q ">] ' x ^ ~ ^2 w ^ ' x ( q ^ + $Z w ^ ' x( ^>- 

qeQh qeQ R qeQ L qeQ R 

Recall that q* is the intersection of the vertical line x = xiq*) and the horizontal line y = yiq y )- 
Since x(p) < xiq*), according to the definition of q*, Yl iq eQ L w ^ < l) — W/2 < J2 q eQ R w ( ( l) always 
holds. Further, as p moves on tt, the value xip) is monotonically increasing. Hence, as p moves, the 
first term above, i.e., [^2 qe n L wiq) — ^2 qe Q R w(q)] ■ xip), is monotonically decreasing. As p moves, 
the set Ql becomes monotonically larger and Qr becomes monotonically smaller. Hence, the value 
^qeQh w (o) ' * s monotonically increasing and the value YlqeQ R w (l) ' x (l) ^ s monotonically 
decreasing. Therefore, as p moves on tt towards q* , the value Ed x ip) is monotonically decreasing. 
The lemma thus follows. □ 
Lemma[3]shows that Ed(p) is a convex function with respect to p in the plane, and Ed(p) attains 
global minimum when p = q* . We have the following corollary. 

Corollary 2 The point q* is a global minimum point in the plane. 

Next, we show how to find ipiP) with the help of q*. We first introduce the minimal points and 
the skyline, and give some observations. 

3.2 The Minimal Points and the Skyline 

For each quadrant R of q* , we will find the ENN of Q in R(~) P, and ip{P) is one of the four ENNs 
with the minimum expected distance to Q. In the following, we focus on the first quadrant and the 
algorithms for the other quadrants are very similar. Note that we view each quadrant as a closed 
region that includes its two bounding half-lines (with the common endpoint q*). 

Let Pi be the subset of the points of P that are in the first quadrant, i.e., P\ = {p \ x{p) > 
xiq*), yip) > yiq*),P € P}. Our goal is to find i/>(Pi), i.e., the ENN of Q in P x . 

For any two different points p\ and p2 in P\, we say that p\ dominates p2 if and only if 
xipi) < x(p2) and yipi) < yiP2)- A point p in Pi is minimal if no point in Pi dominates p (e.g., see 
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Figure 1: The four (red) points connected by the dashed lines are minimal points, and the dashed lines connected 
them is a skyline, p\ dominates P2 and the dotted curve connecting q* and P2 is a monotone path. 

Fig- U) • If Pi € Pi dominates P2 £ P% , then there exists a monotone path ir connecting P2 and q* 
and 7r contains p\ (e.g., see Fig. [p. By LemmaO Ed(pi) < Ed(p2)- Therefore, to compute ip(Pi), 
we only need to consider the minimal points of Pi. Denote by M the set of minimal points in Pi. 
Our discussion above leads to the following lemma. 

Lemma 4 ip(P\) is the point in M with the minimum expected distance to Q. 

Based on Lemma HI one tempting approach is to first find the set M and then choose the point 
in M that is nearest to Q. In the 1-D case, this works quite well because M only has one point. 
In contrast, here the set M may have 0(n) points in the worst case, and thus we cannot afford to 
check every point of M. Below, we use a different approach. 

For each point q in Q, we introduce a horizontal line through q and a vertical line through q. 
Let A be the arrangement of the 2k lines introduced above. Each cell of A is a rectangle, possibly 
with sides in the infinity. Further, every point in Q is a vertex of a cell. 

Consider any cell C of A. For any point p E C, in the sequel, we will show that the expected 
distance Ed(p) is a linear function with respect to x(p) and y(p). As discussed in [3], a consequence 
of this is that the ENN of Q in in P n C is on the convex hull of the points in P n C. 

Denote by li, l r , lb, and l t the lines containing the left, right, bottom, and top sides of C, 
respectively. According to the definition of A, no point of Q lies strictly between li and l r , and 
similarly, no point of Q lies strictly between lb and It- Let Ql be the set of points in Q to the left 
or on li and let Q r be the set of points in Q to the right or on l r . Let Qb be the set of points in 
Q below or on lb and let Qt be the set of points in Q above or on l t . Hence, Q = Ql U Qr and 
Q = Qb U Qt- We have the following lemma. 

Lemma 5 For any point p in the cell C , Ed(p) = C a ■ x(p) + Cj, • y(p) + C c , where 

q&Qh q&Qr q&Qb q&Qr 

c c = Y u, (<?M<?) - w (9)x(q) + Y w ( ( i)y( ( i s ) ~ ^M"?)- 

q£Qr q&Ql q^Qt q&Qs 

Further, with 0(klogk) time preprocessing on Q, given any cell C of A, we can compute C a , Cb, 
and C c in 0(logk) time. 

Proof: The first part (i.e., the values of C a , Cb, and C c ) has been discussed in [3] and it can also 
be easily verified by our analysis in the proof for Lemma [3l Hence, we omit the proof for it. 

For the second part, given any cell C, our goal is to compute the three values C a , Cb, and 
C c . Generally speaking, if, as preprocessing, we compute the prefix sums of the values w{q) and 
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w(q)x(q) in the sorted list of the points of Q by their ^-coordinates, and compute the prefix sum 
of w(q)y(q) in the sorted list of the points of Q by their y-coordinates, then C a , C&, and C c can be 
computed in 0(log/c) time. The details are given below. 

To compute C a , we need to know the value ^2 q£ Q L w(q) and the value ^2 q£ Q R w (q)- Note that 
YlqeQn w ( ( l) = W — YlqeQ L w {l)- We can do the following preprocessing. We sort all points in Q by 
their x-coordinates. Suppose the sorted list is q±, q2, ■ ■ ■ , qu from left to right. For each j, 1 < j < k, 
we compute the value Wi(qj) = Yli=i w (Qi)- For any given cell C, let x\ be the x-coordinate of 
the vertical line containing the left side of C. By binary search on the sorted list qi,q2, ■ ■ ■ ,qk, m 
0(log/c) time, we can find the rightmost point q' in Q such that x(q') < x\. It is easy to see that 
^ q£ Q L w(q) = W\{q'). Note that the above preprocessing takes O(klogk) time, and C a can be 
computed in O(logfe) time. 

In similar ways, we can compute Cb and C c in 0(log k) time, with 0(k log k) time preprocessing. 
Hence, the second part of the lemma follows. □ 

As discussed in [3], Lemma [5] implies that ip{P D C), i.e., the ENN of Q in in P n C, is on the 
convex hull of the points in PnC. More specifically, ip(PPiC) is an extreme point of PT\C along a 
certain direction that is determined by C a and C&, and thus we can do binary search on the convex 
hull to find i/)(Pr\C). 

To find vp(Pi), the algorithm in [3] checks every cell of A in the first quadrant of q* , and for each 
cell C, it finds ip{P\ n C) by doing binary search on the convex hull of the points in P\ n C. Since 
each quadrant of q* may have @(k 2 ) cells, the approach in [3] runs in Q(k 2 ) time. In contrast, we 
show that we only need to check 0(k) cells based on Lemma HI In addition, we use the compact 
interval trees [14] to (implicitly) compute the convex hulls in a faster way than that in [3]. 

Although the number of minimal points in M can be 0(n), we show below that the number of 
cells of A that contain these minimal points is 0(k), and further, we can find these cells efficiently. 

If we order the points in M by their x-coordinates and connect every pair of adjacent points by 
a line segment, then we can obtain a path ttm, which we call a skyline (e.g., see Fig. [T|). Let pi be 
the leftmost point in M and p r be the rightmost point in M. Then, pi and p r are the two endpoints 
of ttm- Further, if we move from pi to p r on -km-, then x-coordinate is monotonically increasing and 
the y-coordinate is monotonically decreasing (e.g., see Fig. [[]). Hence, ttm is a monotone path. 

Lemma 6 The number of cells of A containing the minimal points in M is 0(k). 

Proof: Due to our general position assumption that no two points in P U Q have the same x- 
coordinate or y-coordinate. Each edge of ttm is neither horizontal or vertical. Because ttm is a 
monotone path, each line of A can intersect ttm at most once. Hence, the number of intersections 
between ttm and A is 0(k), which implies that the number of cells that intersect ttm is 0{k). Since 
all points in M are on itm, the lemma follows. □ 

Denote by Cm the set of cells of A that contain the minimal points in M. Next, we give an 
algorithm to compute Cm- A straightforward way is to first compute A and then traverse A by 
following the skyline ttm- But this approach is not efficient due to: (1) computing A takes @(k 2 ) 
time; (2) the size of ttm may be O(n) due to \M\ = 0(n) in the worst case. Below in Lemma [71 we 
propose an 0(A;logn + klogk) time algorithm with certain preprocessing. 

First of all, we sort all points in Q by their x-coordinates and y-coordinates, respectively. 
Accordingly, we obtain a sorted list for the horizontal lines of A and a sorted list for the vertical 
lines of A. With these two sorted lists, given any point p, we can determine the cell of A that 
contains p in 0(logA;) time by doing binary search on both sorted lists. 
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Lemma 7 With 0(n log n) time and 0(n) space preprocessing on P, we can compute the set Cm 
in 0(k log n + k log k) time. 

Proof: One operation frequently used for in our algorithm for computing Cm is the following 
segment- dragging queries. Given any line segment s that is either horizontal or vertical, we move s 
along a given direction perpendicular to s; the query asks for the first point of P hit by s or reports 
no such point exists. Chazelle |6J constructed an 0(n)-size data structure in 0(n log n) time such 
that each segment-dragging query can be answered in O(logn) time. As preprocessing, we build 
such a data structure on P. 

We call the region between any two adjacent vertical lines in A a column (including the two 
bounding lines). Let T>m denote the set of columns of A each of which contains at least one cell 
of Cm- We search the columns of Cm from left to right one by one. For each column D £ Cm, we 
search the cells of Cm in D in a bottom-up fashion. After the searching on D is done, we proceed 
on the next right column of T>m- The details are given below. 

Note that due to the general position assumption that no two points in P U Q have the same 
x- or y-coordinate, every point of P is in the interior of a cell of A. 

We first determine the leftmost column of T>m, denoted by D, as follows. Let po be the leftmost 
point of M. It is easy to see that the column containing po is D (e.g., see Fig. [2|). Hence, after 
having po, D can be determined in 0(log k) time by binary search on the sorted list of the vertical 
lines of A. We determine the point po by the following segment-dragging query. Consider a vertical 
segment sq = q*b on the vertical line x = x(q*), where y(b) = +00 (we may also set y(b) to the 
y-coordinate of the highest point of P). In other words, q*b is the vertical half- line bounding the 
first quadrant of q*. Imagine that we drag so rightwards (i.e., horizontally to the right). Then, po 
is the first point of P hit by so- By using the segment-dragging query data structure on P, po can 
be found in O(logn) time 

After po is found, we determine the column D as discussed above in 0(log k) time. Further, 
notice that the cell of A that contains po is the highest cell in D n Cm, and we denote it by Ch 
(e.g., see Fig. [2]). In the sequel, we search the column D in a bottom- up manner to find all cells of 
Cm H D. More specifically, we first find the lowest cell of Cm H D and then find the second lowest 
cell of Cm H D. This searching procedure continues until we meet the highest cell C^. The details 
are given below. 

We first determine the lowest cell C in Cm H D. To this end, we use another segment-dragging 
query as follows. Let s% be the line segment that is the intersection of the column D and the 
horizontal line y = y(q*). Imagine that we drag s% upwards, and let p\ be the first point of P hit 
by s% (e.g., see Fig. [2|). Observe that C is the cell that contains p%. Hence, after p\ is found by the 
segment-dragging query in O(logn) time, C can be determined in additional O(logfc) time. 

We proceed to determine the next cell C in Cm H D that is higher than C, as follows. We 
first determine the leftmost point P2 in C D P (e.g., see Fig. [2]), which can be done again by a 
segment-dragging query as follows. Let S2 be the left side of C. The point P2 is the first point in 
P hit by dragging S2 rightwards. If P2 is the point po, then we know that C is C;, in which case 
the searching on the column D is done. Below, we assume P2 is not po. 

The vertical line through P2 partitions the column D into two vertical sub-columns, and denote 
by D[ the left sub-column. Let ps be the lowest point in P n D\ (e.g., see Fig. [2]). Let C" be the 
cell containing p%. We claim that C" is is C . We prove the claim in the next paragraph. 

Indeed, since x(ps) < x{p2) and P2 is the leftmost point in PflC, C" cannot be C. Thus, C" is 
higher than C . On the other hand, suppose to the contrary that C" is not C . Then, C is above 
C and below C" . Also, C n D[ must contain a point of M since otherwise all minimal points in 
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Figure 2: Illustrating the algorithm in Lemma [7] the dashed grid is A. The (red) dotted vertical line through p2 
does not belong to A. 

C'flM are dominated by p%, contradicting with that C 6 Cm contains minimal points of M. Since 
C is lower than C" and C D -Dj contains minimal points of M, this contradicts with that P3 G C" 
is the lowest point in P C\D[. Hence, we conclude that C" is C . 

To determine C , it is sufficient to find P3, which again can be done by a segment-dragging 
query, as follows. Let S3 the line segment that is the intersection of the sub-column D\ and the 
horizontal line containing the top side of C. If we drag S3 upwards, the point ^3 is the first point 
in P hit by S3. Therefore, we can determine C in (9(logn + log A;) time. 

We continue this procedure to search the cells in Cm H D until we meet the highest cell C\. 

After the searching on the column D is done, we proceed on the next right column D' in T>m- 
We first determine D' by another segment-dragging query as follows. Recall that p\ is the lowest 
point in PnD. Let S4 be the vertical line segment on the right bounding line of D where the lower 
endpoint of S4 is on the horizontal line y = y(q*) and the upper endpoint has the same y-coordinate 
as pi. If we drag the segment S4 rightwards, let p^ be the first point of P hit by S4 (e.g., see Fig. [2]). 
Then, it is not difficult to see that p^ is in M and the column of A containing p^ is D'. Further, 
the cell of A containing p± is the highest cell in Cm H D' . Hence, after p^ is found, D' and the 
highest cell in Cm D D' can be determined in 0(log&;) time. Next, we proceed to search all cells 
in Cm H D' in a bottom-up manner, in the same way as in the column D. Note that if the above 
segment-dragging query fails to find any point (i.e., such a point P4 does not exists), then all cells 
of Cm have been found, and we terminate the algorithm. 

For the running time, as shown above, for each cell in Cm, the algorithm spends 0(log n + log k) 
time. Due to \Cm\ = O(k) (by Lemma|6|, computing Cm takes 0(k\ogn + klogk) time. 

Clearly, the preprecessing needs 0{n log n) time and 0(n) space. The lemma thus follows. □ 

Due to Lemma U the following lemma is obvious. 

Lemma 8 The ENN ip(P) is in one of the cells o/Cm- 
3.3 Computing the ENN 

To compute tp(Pi), once we have the set Cm, we compute the ENN of Q in C n P for each cell 
C € Cm- By Lemma [HI among the 0(k) ENNs founded above, the one minimizing the expected 
distance to Q is ip{P\). The key is to compute the ENN of Q in each cell C G Cm efficiently. An 
0(n log 2 n)-size data structure is given in [3] that can be built in 0(n log 2 n) time and can compute 
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the ENN in each cell C € A in 0(log 3 n) time. By using the compact interval trees [14) . we have 
the following improved results. 

Lemma 9 An 0(n log n log log n) -size data structure can be built in 0(n log n log log n) time, such 
that given any cell C £ A, the ENN of Q in P DC can be computed in 0(log 2 n) time. 

Proof: Our data structure use the compact interval tree [2], which is for solving the following 
sub-path hull queries in [H]. Given a simple path ir of n vertices in the plane. Suppose the vertices 
are vi,vz, . . . , v n ordered along ir. Given two vertex indices i and j with i < j, the sub-path hull 
query asks for the convex hull of all vertices V{, Vi+i, . . . ,Vj. A compact interval tree data structure 
is given in [13] and for each sub-path hull query, in O(logn) time, it can report a data structure 
that (implicitly) represents the convex hull such that any standard binary-search based operations 
on the convex hull can be implemented in O(logn) time (e.g., finding an extreme point on the 
convex hull along any given direction). The compact interval tree is of O(nloglogn) size and can 
be built in 0(n log log n) time after the vertices of ir are sorted by their x- or y-coordinates. 

Our data structure for the lemma is constructed as follows. In the high-level, it is similar to 
the two-dimensional orthogonal range-tree [13] , A balanced binary search tree T is built based on 
the x-coordinates of the points in P. The leaves of T store the points of P in sorted order from 
left to right, and the internal nodes store splitting values to guild the search on T. For each node 
v of T, it also stores the subset P(v) of points of P in the subtree of T rooted at v, and P(v) is 
called the canonical subset of v. For each canonical subset P(v), we build a compact interval tree 
in the following way. If we sort the points of P(v) by their y-coordinates and connect each pair of 
adjacent points in the sorted list by a line segment, then we can obtain a path tt(v). The points in 
P(v) are vertices of tt(v). Note that ir(v) is a simple path and each horizontal line intersects ir(v) 
at most once. We build a compact interval tree data structure on ir(v) using the approach in [T4] , 
This finishes the construction of our data structure. 

For the preprocessing time and space, for each canonical subset P(v), constructing the compact 
interval tree data structure on ir(v) takes O(mloglogm) time and space, where m = \P(v)\. 
Note that the y-sorted list of P(v) can be built during the construction of T in a bottom-up 
manner. Hence, the whole data structure takes 0(n log n log log n) space and can be constructed 
in 0(n log n log log n) time. 

Given any cell C € A, which is rectangle, our goal is to find the ENN of Q in C DP. Essentially, 
we are looking for an extreme point in C D P along a certain direction. As discussed in [3], this 
direction is determined by the two factors C a and Cb as defined in Lemma [5j We assume we have 
already known this direction (e.g., C a , Ct, and C c can be computed in O(logfc) time by Lemma|5|. 
Denote by a the above direction. 

Let x\ and x r be the x-coordinates of the two vertical lines bounding C, respectively, with 
xi < x r . Let ?/6 and yt be the y-coordinates of the two horizontal lines bounding C, respectively, 
with y^ < yt- Using the range [x^,x r ], we first find the O(logn) canonical subsets whose union are 
the set of points in P between the two vertical lines x = xi and x = x r . For each such canonical 
subset P(v), we use the range [yb,Vt] to determine the sub-path of ir(v) contained in C, which can 
be done by binary search on the y-sorted list of P(v); subsequently, we use the compact interval 
tree data structure on ir(v) to (implicitly) report the convex hull of the sub-path, after which we 
search the extreme point on the convex hull along the direction a in O(logn) time. In this way, 
we obtain O(logn) extreme points for the O(logn) canonical subsets, and the one minimizing the 
expected distance to Q is the ENN of Q in C n P. Note that since we already have the three factors 
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C a , Cb, and C c as denned in Lemma [5j for each extreme point found above, its expected distance 
to Q can be computed in constant time. 

Therefore, the ENN of Q in C H P can be found in 0(log 2 n) time. □ 

3.4 Wrapping Things Up 

We summarize our data structure and algorithm for computing ip{P,Q). 

Our preprocessing on P includes the following procedures. (1) Sort all points in P by their 
^-coordinates and y-coordinates, respectively. (2) Construct the segment-dragging query data 
structure on P. (3) Build the data structure for Lemma El i.e., the range tree with compact 
interval trees as the secondary data structures. The total time and space are dominated by (3), 
i.e., 0(n log n log log n) time and 0(n log n log log n) space. 

Given any uncertain query point Q, our query algorithm for computing ip(P,Q) includes the 
following steps. (1) Sort all points in Q by their their x-coordinates and y-coordinates, respectively. 
(2) Process Q as in Lemma [5j (3) Compute the global minimum point q*. (4) Divide the plane 
into four quadrants with respect to q*. In each quadrant R, we find the ENN of Q in P n R in 
the following way. Suppose R is the first quadrant. (4.1) Find the set Cm by Lemma [7J (4.2) For 
each cell C in Cm, find the ENN of Q in P n C by LemmaEl among the 0(k) ENNs, the one with 
the minimum expected distance to Q is the ENN of Q in R n P. (5) Among the four ENNs found 
from each quadrant of q* , the one with the minimum expected distance to Q is ip(P,Q). For the 
running time of the query algorithm, the first three steps can be done in 0(klogk) time; Step (4) 
can be done in 0(k log 2 n + k log k) time. Hence, the total time is bounded by 0(k log 2 n + k log k). 

In summary, we have the following theorem. 

Theorem 2 Given a set P of n exact points in the plane, a data structure of 0{n log n log log n) 
size can be built in 0(n log n log log n) time so that for any uncertain query point Q, the ENN 
ip(P, Q) can be found in 0(k log 2 n + k log k) time. 

4 Conclusion 

In this paper, we present improved results on nearest neighbor queries in the plane where the data 
is exact and the query is uncertain under the L\ distance metric. Our improvements are based 
on two aspects: one is a deeper understanding of underlying geometric properties, and the other 
is the usage of more advanced data structures. We also present an efficient data structure for the 
same problem in the one-dimensional space where the distance is measured by either the L\ or L2 
metric. 
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